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ABSTRACT 

In this paper we consider the problem of grouped variable selection in high-dimensional 
regression using l\-t q regularization (1 < q < oo), which can be viewed as a natural 
generalization of the £±-£2 regularization (the group Lasso). The key condition is that 
the dimensionality p n can increase much faster than the sample size n, i.e. p n ^> n (in 
our case p n is the number of groups), but the number of relevant groups is small. The 
main conclusion is that many good properties from ^i-regularization (Lasso) naturally 
carry on to the £\-£ q cases (1 < q < 00), even if the number of variables within each 
group also increases with the sample size. With fixed design, we show that the whole 
family of estimators are both estimation consistent and variable selection consistent 
under different conditions. We also show the persistency result with random design 
under a much weaker condition. These results provide a unified treatment for the whole 
family of estimators ranging from q = 1 (Lasso) to q = oo (iCAP), with q = 2 (group 
Lasso) as a special case. When there is no group structure available, all the analysis 
reduces to the current results of the Lasso estimator (q = 1). 



Keywords: £\-i q regularization, ^-consistency, variable selection consistency, sparsity ora- 
cle inequalities, rates of convergence, Lasso, iCAP, group Lasso, simultaneous Lasso 
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I. Introduction 



We consider the problem of recovering a high-dimensional vector (3* G M m ™ using a sample 
of independent pairs (Xi., Yi), . . . , (X n ,,Y n ) from a multiple linear regression model, Y = 
X(3* + e. Here Y is the n x 1 response vector and X represents the observed n x m n design 
matrix whose z-th row vector is denoted by X i9 . (3* is the true unknown coefficient vector that 
we want to recover, and e = (ei, . . . , e„) is an n x 1 vector of i.i.d. noise with e« ~ A/"(0, a 2 ). 

In this paper we are interested in the situation where all the variables are naturally parti- 
tioned into p n groups. Grouped variables often appear in real world applications. For ex- 
ample, in many data mining problems we encode categorical variables using a set of dummy 
variables and as a result they form a group. Another example is additive model, where each 
component function can be represented using its basis expansions which can be treated as 
a group. Suppose the number of variables in the j-th group is represented by dj, then by 
definition we have m n = X^j=i 4r We can rewrite the above linear model as 

Pn 

Y = X(3* + e = ^XjP* + e (1.1) 

where Xj is an n x dj matrix corresponding to the j-th group (which could be either cat- 
egorical or continuous) and (3* is the corresponding dj x 1 coefficient subvector. Therefore, 
we have X = (Xi,...,X Pn ) and (3* = {(3*1, . . . , (3*p n ) T . All predictors and the response 
variable are assumed to be centered at zero to simplify notation. Furthermore, we use Xj 
to represent the j-th column in the design matrix X and assume that all columns in the 
design matrix are standardized, i.e. — II^'IIL = 1, j = 1, . . . , m n . Similar to the notation of 
Xj, we denote [3* (j = 1, . . . ,m n ) to be the j-th individual element of the vector (3*. Since 
we are mainly interested in the high-dimensional setting, we allow the number of groups p n 
to increase as the number of examples n increases and our results mainly focus on the case 
where p n 3> n. Furthermore, we also allow the group size dj to increase with n at a rate 
dj = o(n) and define d n = maxj dj to be the upper bound of the group size for a fixed n. In 
the rest of the paper we will suppress the subscript n when there is no confusion. 

In order to obtain a reliable estimation of (3* when p n 3> n, the key assumption is that the 
true coefficient vector (3* is sparse. Denote S — {j : 11/3^11^ ^ 0, j — 1, . . . ,p n } to be the 
set of group indices and let s n = \S\ to be the cardinality of the set S, we also denote /3| to 
be the vector concatenating all subvectors /3j°s for j G S. The sparsity assumption means 
that s n -C p n . Therefore, even if (3* has a very high dimension, the only effective part is f3g 
while the remaining part (3$ c = 0. Our task is to select and recover the nonzero groups of 
variables corresponding to the index set S. 

Sparsity has a long history of successes in solving such high-dimensional problems. Without 
considering t he group struc ture, t here exist many classical meth ods for v a riable selection, 



such as AIC flAkaikd . Il973f ). BIC (ISchwarzl . Il978h . Mallow's C p flMallowsl . Il973h . etc. Al 



though these methods have been proven to be theoretically sound and have been shown to 
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perform well in practice, they are only computationally feasible when the number of vari- 
ables is small. Rec ently, more attention has been focus ed on the ^-regularized least squares 
(Lasso) estimator (jTibshiranil . Il996l ; I Chen et all Il998l ) which is defined as 



arg mm 



{ 



2n 



(1.2) 



where A n is the regularization parameter for the £i-norm of the coefficients j3, while j3 n means 
the Lasso solution when A n is used for regularization. In the following, we will suppress the 
superscript if not confusion is caused. Lasso can b e formulated as a quadratic programmin g 
problem and the solution can be solved efficiently (lOsborne et all l2000t lEfron et all 12004 ). 



Its a symptotic properties for fixed d imensionality have bee n stud ied in (jFu and Knight 
20001 ) . For high dimensional setting, iGreenshtein and Ritovl (120041 ) prove that Lasso esti- 



mator is persistent, in the sense that, when constrained in a class, the predictive risk of 
the Lasso estimator con verges to the risk obtained by the oracle estimator in probability . 
However, recent studies (IMeinshausen and Biihlmannl . 120061 ; IZhao and Yul . 120071 ; IZoul . 120061 ) 
show that the Lasso estimator is not in general variable selection consistent, which means 
that in general the correct sparse subset of the r elevant variables can no t be id entified even 
asymptotically. In particular, in (IZhao and Yul . 120071 : IWainwright et all 120061 ). it is shown 
that in order for Lasso to be va r iable selection consistent, the so-called irrepresentable con- 
dition has to be satisfied. IZoul (120061 ) propose the adaptive Lasso and show that by using 
adaptive weights for different variables, the t\ pena lty can lead to variab l e sele ction consis- 
tency. In terms of estimation, it has been show in IMeinshausen and Yul (120061 ) that under 
weaker conditions, the Lasso estimator is ^-consistent for high-dimensional setting where the 
total number of vari ables can grow almost as fast as exp(n). Under a stronger assumption, 
Bunea et all (j2007al ) further proves the sparsity oracle inequalities for the Lasso estimator 
using fixed design, which bounds the ^2-norm of the predictive error in terms of the number of 
non-zero components of the oracle vector. Such results can be used applied to nonparamet- 
ric adaptive regression estimation and to the problem of aggregation of arbitrary estimators. 
Parallel to the fixed design result, a similar result for the random de sign can be found in 
(IBunea et all l2007bl ). A more recent result from (IBickel et all 120071 ) refine similar oracle 
inequalities using weaker assumptions. All these results show that for sparse linear models, 
Lasso can overcome the curse of dimensionality even when facing increasing dimensions. 

When variables are naturally grouped together, it is more meaningful to select variables at a 
group level instead of individual variables, as can be seen from previous examples. A general 
strategy for grouped variable selection is to use block £i-norm regularization. For variables 
within each block (group), an £ q norm is applied, and different blocks are then combined 
by an L norm (therefore t he name l\-l q regularization). One such example is the group 
Lasso (lYuan and Linl . 120061 ) . which is an extension of Lasso for grouped variable and can 
be viewed an regularize d regression. Oth er works related to grouped variable selection 
include the iCAP estimator (IZhao et all 120081) . which c a n be v iewed as an i\-ioo regularized 
regression, and group logistic regression (IMeier et all 120071 ). etc. Using random design, 
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Meier et al.l (120071 ) proved the estimation con sistency resu lt for group Lasso with Lipschitz 
type loss functi ons. Also with rando m design, iBachl (120071 ) derived a similar ir represent able 
condition as in (jZhao and Yul . 120071 ) and proved the variable selection consistency result for 
group Lasso. However, to the best of our knowledge, there isn't corresponding result for 
estimation and variable selection consistency for the group Lasso and iCAP estimators using 
fixed design, nor the persistency results using random design. There is also no systematic 
theoretical treatment for the whole family of the more general l\-l q regularized regression 
with 1 < q < oo. 

Our work tries to bridge this gap and provide a unified treatment of l\-i q regularized re- 
gression for the whole range from q = 1 to q = oo. The main conclusion of our study is 
that many good properties from ^-regularization (Lasso) naturally carry on to the l\-l q 
cases (1 < q < oo), even if the number of variables within each group can increase with the 
sample size n. Using fixed design, when different conditions are assumed, we show that t\-t q 
estimator is both estimation consistent and variable selection consistent, and if the linear 
model assumption does not hold, sparsity oracle inequalities for the prediction error could 
still be obtained under a weaker condition. Using random design, we show that a constrained 
form of the i\-i q regression estimator is persistent. Our results provide simultaneous analysis 
to both the iCAP (q = oo) and the group Lasso estimators (q = 2). When there is no group 
structure, all the analysis naturally reduces to the current results of the Lasso estimator 
(q = 1). One interesting appl ication of these results is to analyze the simultaneous Lasso es- 
timator (ITurlach et all 120051 ). which can be viewed as an £i-£oo regularized regression using 
block designs. 

The rest of the paper is organized as follows. In Section [2] we first introduce some pre- 
liminaries of the i\-t q regularized regression and then describe some characteristics of its 
solution. In Section [3j we study the variable selection consistency result. In Section [U we 
study the estimation consistency and the sparsity oracle inequalities. In Section [5j we study 
the persistency property. We conclude with some discussion in Section [61 



II. t\-i q Regularized Regression 

Given the design matrix X and the response vector Y, the i\-i q regularized regression esti- 
mator is defined as the solution of the following convex optimization problem: 

, Pn 

fr- = argmin — \\Y - X(3\\\ + X n J>i) W !l/?i!k ( 2 -l) 

where A n is a positive number which penalizes complex model and q' is the conjugate expo- 
nent of q, which satisfies — + - = 1 (assuming i = 0). The terms (dj)^ q are used to adjust 
the effect of different group sizes. It is easy to see that when q = 1, this reduces to the stan- 
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dard Lasso estimator; when q = 2, this reduces to the group Lasso estimator (lYuan and Lin . 
20061 ); when q = oo, this reduces to the £i-£qo regularized regression estimator, or the iCAP 
estimator defined in (IZhao et all 1200a ). 



To characterize the solution to this problem, the following result can be straightforwardly 
obtained using the Karush-Kuhn- Tucker (KKT) optimality condition for convex optimiza- 
tion. 



Proposition 2.1. (KKT conditions) A vector (3 = ((3f , . . . ,j3, 



m = V^Pn A. 



is an optimum of the objective function in (12. ip if and only if there exists a sequence of 
subgradients Qj G , such that 



(xp-Y) + \ n {d j ) 1/q 'g j = 0. 



(2.2) 



The subdifferentials d\\Pj\\e is the set of vectors G M dj satisfying 
Ifl<q< oo, then 



B*(l) 



if pj = 



% = d\\P : 



Ig/^sign^-, 



.i 

o.w. 



(2.3) 



i=i 



where B q ' (1) denotes the ball of radius 1 in the dual norm, i.e. 1/q + 1/q' = 1. It's easy to 
see that \\gj\\i , < 1 for any j. 



If q = oo then 

& = d\\Pj\\io 



conv{sign(/3 i£ )e £ : 



if Pj = 



(2.4) 



wiere conv(A) denotes the convex hull of a set A and ei the i-th canonical unit vector in 
M^'. It's also easy to see that \\gj\\i , = H^H^ < 1 for all j when q = oo. 



If q = 1 then 



9j = dWPAk = e R^' : 6 e 9| • |(x,),£ = 1, . . . ,^}. 



(2.5) 



From proposition 12. 11 the i\-l q regularized regression estimator can be efficiently solved even 



with large n and p n . For example, blockwise coordinate descent algorithms as in (IZhao et al. 



20081 ) can be easily applied. When q = 1 and q = oo, due to fact that feasible parameters 
are constrained to lie within a polyhedral reg i on with parallel l evel curves, efficient path 
algorithm can be developed (jEfron et all . 12004 ; IZhao et all . 120081 ). At each iteration of the 
blockwise coordinate descent algorithm, ftj for j — 1, . . . ,p n is updated, with the rest of the 
coefficients fixed. Coupled with a threshold operator, these algorithms general converge very 
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fast and exact s olution can be obtained. Stand ard optimization methods, such as interior- 
point methods (IBoyd and Vandenberghd . 12004 ). can also be directly applied to solve the 
t\-t q regularized regression problems. 



It is well-known (jOsborne et al.l . 120001 ) that under some conditions, the Lasso can at most 
select n nonzero variables even in the case p n ^> n. A similar but weaker result can be 



obtained for the 



regularized regression. 



Proposition 2.2. For the i\-i q regularized regression problem defined in equation ( 12.1)) 
with X n > 0, there exists a solution (3 X such that the number of nonzero groups \S(0)\ is 
upper bounded by n, the number of given data points, where S(f3) = {j : (3j ^ 0} 



Remark 2.3. Notice that the solution to l\-l q regularized regression problem may not be 
unique especially when p n 3> n (similar to the Lasso case), since the optimization problem 
might not be strictly convex. Consequently, there might exist other solutions that contain 
more than n active groups. However, a compact solution {3 with |if>(/3)| < n can always be 
obtained by following an easy and mechanical step described in the proof of Proposition 12.21 

Proof: From the KKT condition in proposition 12.11 we know that any solution (3 should 
satisfy the following conditions (j — 1, . . . ,p n )'- 

-Xj{Y - XP) = X 9j 
n J 

where gj = d\\/3j\\t . Now suppose there is a solution (3 which has s = \S((3)\ > n number of 
active groups, in the following we will show that we can always construct another solution 
(3 with one less active group, i.e. ISX/^OI = ^(Z^)! — 1- 

Without loss of generality assume that the first s groups of variables in (3 are active, i.e. 
(3j ^ for j — 1, . . . , s. Since 

s 

Xf3 = J2 X A G R "' Xl 

3=1 

and s > n, the set of vectors Xifli, . . . , X S (3 S are linearly dependent. Without loss of gener- 
ality assume 

X x Px = a 2 X 2 % + ... + a s X s fi s . 

Now define (3j = for j = 1 and j > s, and j3j = (1 + aj)(3j for j = 2, . . . , s, and it is 
straightforward to check that (3 satisfies the KKT condition and thus is also a solution to the 
t\-t q regularized regression problem in equation 12. 1[ The result thus follows by induction. 
□ 

The main objective of the paper is to investigate several important statistical properties of 
the l\-l q estimator [3. We first give some rough definitions of the properties that we would 
like to establish, more details will be shown in their corresponding sections. 
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Definition 2.4. (Variable selection consistency) An estimator is said to be variable selec- 
tion consistent if it can correctly recover the sparsity pattern with probability goes to 1. For 
the case of grouped variable selection, (5 is said to be variable selection consistent if 

p (s0) = S(J3*j) -> 1. (2.6) 

Definition 2.5. (£i-estimation consistency) An estimator is said to be £i-estimation con- 
sistent if the li-norm of the difference between the estimator and the true parameter vector 
converges to in probability, i.e. 

V<5>0 P (\\P-P*\\ ei > 5) -»• 0. (2.7) 



Definition 2.6. (Prediction error consistency) An estimator is said to be prediction error 

consistent if the prediction error, defined as —II Y — X/3*||? , of the estimator converges to 

n 

in probability, i.e. 

V5>0 P (-\\Y-X(5*\\l 2 > 8j -»• 0. (2.8) 



Definition 2.7. (Risk consistency or Persistency) Assuming the true model f*(X) does not 
have to be linear, for the regression model with random design, (X, y) ~ F n e T n , where 
T n is a collection of distributions of i.i.d. m n + 1 dimensional random vectors. Define the 
risk function under the distribution F n to be Rf„{P) (More details in Section^. Given a 
sequence of sets of predictors B n , the sequence of estimators (3 Fn G B n is called persistent if 
for every sequence F n e T n , 

R Fn $ Pn ) ~ RfM") ^ 0> ( 2 -9) 

where 

/3f» = argminR Fn {f3). (2.10) 

/3eB„ 



For the l\-t q regularized regression, later, we will use B n = {(3 : Y^jZiidj) 1 ^ \\Pj\\e 2 — Ln}, 
for some L n = o([n/ (log n)) 1 ^). 

The following table gives a high level summary of our main results, ordered from very 
stringent assumptions to much weaker assumptions: 
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Variable selection consistency: P — S([3*j — > 1 (Rl) 

. / _ I log TTl n \ 

^-estimation convergence rate: ||/3 — P*\\e 1 = Op I s n d n \l I (R2) 

Prediction error convergence rate: — \\Y — = Op ( n - — - — - J (R3) 

n V n J 

Prediction (misspecified model): — II Y — /"111 = Op ( — - — - — -J (R3*) 

n \ n J 

Persistency (misspecified model): R Fn Fn ) - R Fn (P* n ) ^ ( R4 ) 



Remark 2.8. (Rl) to (R3) assume the true model must be linear, while (R3*) and (R4) 
relax this condition so that the model can be misspecified. Even though (R3) and (R3*) look 
very similar, (R3*) dropped the linear model assumption at the price of enforcing another 
"weak sparsit'ij' 1 condition. Also, (Rl), (R2), (R3), and (R3*) are fixed design results, while 
(R4) is a random design result. 

In general, the condition for variable selection consistency is the strongest since it involves 
not only certain relations among n, X n , p n , s n , d n , but also the minimum absolute value of 
the parameters, p* = min^s ||/3*||oo- The ^-estimation consistency and prediction error con- 
sistency requires weaker conditions than variable selection consistency. Unlike the previous 
properties, when the model is misspecified, the prediction error consistency in (R3*) follows 
from a sparse oracle inequality. Since both the sparsity oracle inequalities and persistency 
does not require the existence of a true linear model and thus is more general. Especially, the 
persistency is about the consistency of the predictive risk when considering random design 
and only need a very weak assumption about the design. 



III. Variable Selection Consistency 



In this Section we study the conditions u nder which the t\-t q esti mator is variable s e lectio n 



consistent. Our proof is adapted from (jWainwrightl . 120061 ) and ( jRavikumar et all 120071 ). 
The former paper develop the "witness" proof idea which is the main framework used in 
our proof. The latter paper mainly treat variable selection consistency when q = 2 in a 
nonparametric sparse additive model setting, which makes their conditions more stringent 
than ours even when q — 2. 

In the following, Let S denote the true set of group indices {j : Xj ^ 0}, with s n = \S\, and 
S c denote its complement. Denote A min (C) to be the minimum eigenvalue of the matrix C . 
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Then, we have 



Theorem 3.1. Let q and q' are conjugate exponents with each other, that is — I = 1 

q q' 

and 1 < q, q' < oo. Suppose that the following conditions hold on the design matrix X: 



A 



1 



mm 1 x s x s > C min > 
n ' 



max 



(XjXs^XjXs)- 1 



where II ■ II a b is the matrix norm, defined as 



< 1 — 8, for some < 5 < 1. 



sup 



(3.1) 

, 1 < a, b < oo. Assume 



X \\x\Ua 

the maximum number of variables with each group d n — > oo and d n = o(n) . Furthermore, 
suppose the following conditions, which relate the regularization parameter X n to the design 
parameters n, p n , the number of relevant groups s n and the maximum group size d n : 




n 



1/9' 



\og((p n - s n )d n ) 

-i 



-XgX s 



n 



oo. 



0. 



(3.2) 
(3.3) 



where p* = miiijgs ||/3*||oo- Then, the £i-£ q regularized regression is variable selection con- 
sistent. 



Remark 3.2. First, notice that the result establi shed in Theorem 13.11 is a direct general- 
ization of the variable selection result for Lasso in JWainwrightl . bpOffl by setting q = 1 and 



d n = 1 (as then the £i-£ q degenerates to Lasso). This gives the sufficient conditions for exact 
recovery of sparsity pattern in (3* for the i\-i q regularized regression. Also notice that when 
d n is bounded from above, the conditions are almost the same as those of Lasso except the 
condition in equation 13.11 which depends on the value of q. 

Second, we consider the case when p n is bounded away from zero. Assuming that q = oo 
and d n = n 1 / 5 (such as in the fitting of additive model with basis expansion), we must have 

A 2 n 

A ra = o(n -1 / 5 ) and as a result of — — — > oo, we need to have p n = o(exp(n 3 / 5 )). 

log((p n - s n )d n ) 

This means that even when we have increasing group size d n , the sparse pattern (in terms 
of grouped variables) can still be correctly identified with a large p n . 

Finally, when minimum parameter value p n — > 0, to ensure variable selection consistency, it 
can at most converge to zero at a rate slower than n' 1 ! 2 . 



Proo f: Note, the special case when q = 1 has already been proved in (jWainwright et al. 



20061 ). Here, we only consider the case that 1 < q < oo. A vector (3 e M. m ", m n = Y7fL\ dj, 
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is an optimum of the objective function in (12. ip if and only if there exists a sequence of 
subgradients gj G d||/3j|L , such that 



l -X T (^Xfa -Y^+ = 0. 



(3.4) 

The subdifferentials <9||/?.,-||^ satisfies the KKT conditions in proposition 12. 1[ 



Our argument closely follows the approach of IWainwright et al.l (120061 ) in the linear case. In 
particular, we proceed by a "witness" proof technique, to show the existence of a coefficient- 
subgradient pair {(3,g) for which supp(/5) = supp(/5*). To do so, we first set (3s c — and c/s 
to be the vector concatenating all the subvectors g/s, for j G S. We also define gs^ and (3s 
in a similar way. And we then obtain (3s and 'gs c from the stationary conditions in (13.41) . By 
showing that, with high probability, 

% ^ for j G S (3.5) 
% G B*'(l) for jeS c , (3.6) 

this demonstrates that with high probability there exists an optimal solution to the opti- 
mization problem in (12. ip that has the same sparsity pattern as the true model. 



Setting Ps c = and 

' fel^sign^ 



1 < q < oo 



9i=\ IV Ji=iJ (3-7) 

conv{sign(/5 i£ )e £ : \(3 je \ = \\(3j\\ £oc } q = oo 

for j G S, denote W = diag((rf 1 ) 1 / g 7 dl , . . . , {d p ) l / q> I dp ) where I d . is a dj x dj identity matrix. 
We define Ws to be submatrix of W by extracting out the rows and columns corresponding 
to the group index set S. The stationary condition for (3s is 

ixj (x s Ps - Y) + X n Wsgs = 0. (3.8) 
Let e = (ei, . . . , e n ) T , then the stationary condition can be written as 

-X T S X S (% ~ ft) ~ ~X T s e + X n W s g s = (3.9) 
n V J n 



or 



& ~ P* s = {^ xT s x s) {^X T s e - X n W s gs^ (3.10) 



assuming that —XaX s is nonsingular. Recalling our definition 

n 



mm 11/3*11^ >0. (3.11) 
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it suffices to show that 



h-P 8 \\t-< 



Pn 



(3.12) 



in order to ensure that supp(/?£) = supp(/3s) — yj '■ WPjWtao 0|- 

Using T*ss = —X^Xs to simplify notation, we have the bound 
n 



\\Ps - mu, < 



S 55 \~ X S e 



+ Aj|£^W s <7< ? | 



(3.13) 



We now proceed to bound the quantities above. First note that for j G S, \\gj\\i , = 1. 
Therefore, since 

we have that 
Therefore 



\gs\Uoo = max 11^-11^ < max||^|| v = 1 



l^sl^Sgsl < (d n ) 1/q ' \\^ss\ 



(3.14) 

(3.15) 



\\Ps - < 



S 5S \~ X S e 



+ A n (4) lA >'|| E -i 



Finally, consider Z = ^— XjeJ . Note that e ~ iV(0, er 2 /), so that Z is Gaussian as well, 
with mean zero. Consider its £-ih component, Z^ = eJZ. Then E[Z|] = 0, and 



(7 



Var(Z £ ) = — e e ^ ss e e < 



?? 



nC„ 



(3.16) 



By the comparison results on Gaussian maxima (ILedoux and Talagrandl . Il99ll ) , we have then 
that 



log(sd n ) 



Efll-^lkoo] - 3ylog(sd n )maxy Var(Z^) < 3a 
An application of Markov's inequality then gives that 



P 



h - PsWi 



Pi 



Pn 



>Y) < P(ll^lk + A„K) w ||s^|| o0i0o > 2 



< ^{E[||^|U+An(4) W ||S55|L l00 } 



- + A n (4) w ||s^ l 



P% I V nO, 



(3.17) 



(3.18) 
(3.19) 



which converges to zero under the condition that 



1 

Pn 



\og(sd„ 



n 



0. 



(3.20) 
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We now analyze ggc. Recall that we have set fis c — fls° = 0- The stationary condition for 
j G S c is thus given by 



hi] [x s % - Xs(3* s - e) + Xnid^'gj = 0. (3.21) 



Therefore, 



= ^f- jWsX^ UnWsgs - ~*Je) + ^Xj c e| (3.22) 

from equation (13.101) . 
We want to show that 

©6^(1) (3-23) 
for all j G S' c . From ( 13.22jl . we see that Tjj is Gaussian, with mean 

N = E(&) = (d,)-V%E^ fi J Sl (3.24) 

We then obtain the bound 



I^IU,, < H^llv = llEj-sS^II^ < 1 - S for some 5 > 0. 

It therefore suffices to show that 



P (mw(d,)V<% - Milk > Q — (3.25) 



since this implies that 



\\9j\\e ql < \\to\U q > + \\9j -Vj\\i q , (3.26) 

< llA*illv + ( d i) W H»-A*illA- (3-27) 

< (1-*) + 5 + 0(1) (3.28) 

with probability approaching one. To show (13.251) . we again appeal to comparison results of 
Gaussian maxima. Define 



Z j = {d s )V*K<3i ~ to) =XJ(I- XsiX^Xg)- 1 ^ ^ (3.29) 
for j G S c . Then Zj are zero mean Gaussian random vector, and we need to show that 

P (max ^ Zj }^ > _ ) — ►oq. (3.30) 
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Let Zjk represent the fc-th element of Zj for j G S c . A calculation shows that M(Z 2 k ) < — . 
Therefore, we have by Markov's inequality and the comparison results of Gaussian maxima 
that 



P fmaxS^ > ~ ] < t?-E ( max Z ;/ ,| 



2 

< 



6X r 



3^flog((p n - s n )d n ) max (Z%)\ (3.31) 

6(7 / log((p n -s ra ")4)" 
- 50 n (3 ' 32) 



which converges to zero under the condition that 

Xln 



log((p„ - s n )d n ) 
This is just the condition in the statement of the theorem. □ 



oo. (3.33) 



IV. Estimation Consistency 

In this section, we prove the estimation consistency results under two types assumptions: 

(i) When the model is correctly specified, i.e., the true model is linear, we can achieve both 
^-consistency results and derive the optimal rate of convergence for the prediction error. 

(ii) When the model is misspecified, i.e. the true model is not linear, we can still achieve 
a sparsity oracle inequality, which provide a bound of the prediction error using the loss 
of the prediction oracle with the number of nonzero groups of the prediction loss involved 
in. Under the "weak sparsity condition, we can still obtain a rate of convergence of the 
prediction error which is similar to the convergence rate obtained under the linear model 
assumption. 



We begin with a techni cal lemma, which is essentially lemma 1 as in (IBunea et al.l . l2007al ) 



and (IBickel et all 120071 ) , but need to be extended to handle the group structures in the more 



general t\-t q regularized regression setting. 

Lemma 4.1. Let €\ . . . , e n be independent Af(0, a 2 ) random variables with a 2 > and Let 
Y = Xf3 be the l\-l q regularized regression estimator with 1 < q < oo as in A2.1\) with 



Xn = A „J^ (4.1) 
V n 

for some A > 2a/2. Then, for all m n > 2, n > 1, with probability of at least 1 — m n 1_j42//8 
we have simultaneously for all f3 G M m " : 

h\Y - X(3*\\l + X n f>,) W ll^ - PAW < l\\ X P ~ X P*\\l + 4 E X n( d j) 1/q '0j ~ ^II4 4 - 2 ) 



13 



where S(@) denotes the set of nonzero group indices of (3. 
Proof: By the definition of Y — X/3, we have 

, Pn -, Pn 

for all /3 G R mn , m n = X^=i which we may rewritten as 

, Pn 

jxp*-xMi+2\ n j2(d j ) l/q 'mu q 

< -\\XP* - XP\\1 + 2A n £ (d;) W llftlk + -e T X(^- (3). (4.3) 



n 14 — ' n 

For each j = 1, . . . , m„, we define the random variables V ? - = —Xfe, and the event 

n - 



^=f){2|^|<A„} 



Under the normality assumption, we have that 

V^~JV(0,<7 2 ) j_=l,...,m n . (4.4) 

Using the elementary bound on the tails of Gaussian distribution we find that the probability 
of the complementary event A c satisfies 

PM C } < J2 F ^\ V l\ > v^A n /2} < m n ¥{\Z\ > v^A n /(2a))} (4.5) 
i=i 

< m n exp = m n exp ( = m ^/s (46) 



where Z ~ A/"(0, 1). Then, on the set A, we have 

2 m » m n p n 

-e T X((3-P) = 2E^(S-/3,)<E A nl^-^l<E A «^) W llS-^ll^ 
j=i j=i j=i 

and therefore, still on the set A, 

^\\X(3*-XMl<-\\X(3*-X(3\\l 

Pn Pn Pn 

+2A ra ^ (d i ) W ll/?ilU 9 + ^ A ri K-) W ll^ - A IU, - 2A„^ K)W||^.| Ug . ( 4. 7) 
j=i j=i j=i 
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Pn 



Adding the same term \ n {dj) x ^ q \\(3j — [3j\\e q on both sides, we obtain 

3=1 

hxp - xp\\l + a„ £ (dj^'m - Mb < hx/r - xp\\l 

lb lb 

Pn Pn Pn 

+2A n (di) W IIAIk + 2An £ (d,) - - 2A n £ (d^'ll&Uv (4-8) 
j=i j=i j=i 

Recall S((3) to be the set of non-zero group indices of (3. Rewriting the right-hand side of 
the previous display, then, on set A 



■ Pn 

-\\XP* - XP\\i + *nJ2 WllS - PAk 



+2 I £ A„(^) W ||^|| 4 - £ A n (d,) W ll^lk 
\ies09) J6S08) 

< I||x/5-X^||, 2 2 +4 AnK-) W ||^-^lk 

ieS(/3) 

by the triangle inequality and the fact that f3j = for j ^ S'(/3). □ 
yl. Estimation Consistency Under the Linear Model Assumption 

Assuming the true model is linear, to obtain the ^-consistency result, a key assumption on 
the design matrix is needed, which is stated as the following 

Assumption 1 Recall that s n = S(/3*), assume for any vector 7 6 IR" 1 ™ satisfies 

K = min min > Q ^ 4 g ^ 

SoC{l,..., P y.\S \<s n E 3 - eS c(^) 1 /«'ll7^IU s <3E f6So K) 1 / 9 'll7 J IU g y/nJj^jeSo^^Mi, 

Remark 4.2. Before proving the following theorem, we pause to make some comments 
about this assumption. 

First, For q = 1 ( thus, q' = 00 ) , this assumption is very similar to the restricted eigenvalue 



assumption as in (IBickel et all 120071 ). which is defined as 



K= min _ min Jj^A > 0. (4.10) 

5oC{l,.., P }:|5o|< S „ E jeS g ll7jlk<3E i€So MUi y/nJJ2 je S H\\t 2 
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However, our assumption is slightly weaker, due to the fact that, for any 7 e 



I < djHWl- (4-11) 



Second, the quantity yJ2jes Q (^j) 2 ^ 9 ' 1 \\lj\\e q i* 1 our assumption balances between q — 1 and 
q = 00. For example, when q = 1, ||7j||i is relatively large, but (dj) 2 ^'^ 1 = (dj)" 1 is very 
small. While for q = 00, \\yj\\z q = Wljllj^ is relatively small, however, (dj) 2 / q '~ 1 = (dj) 1 is 
very significant. In this sense, q = 2 seems the most balanced one, due to the fact that 

E^-nwil ^ E v^inii < E ^ibiiiL ( 4 - 12 ) 

ie5 j'gSo ieSo 

Therefore, among g = 1,2, 00, g = 2 needs the weakest assumption, this provides some 
insights about why group Lasso might also be a suitable choice for grouped variable selection. 
However, we need to more cautions to say which value of q is the best. Since in real 
applications, the choice of q might depends on the true relevant coefficients (3* s . If different 
components in the relevant groups are on the same order of magnitude, q = 00 might be 
more suitable, on the contrary, if some relevant coefficients are very small relative to the 
others, q — 1 might be better, we plan to investigate this issue in a separate paper. 

Theorem 4.3. (Estimation consistency under linear model assumptions) Under assump- 
tion [21 let ex, ... ,e n be independent VV(0, a 2 ) random variables with cr 2 > 0. Consider the 
t\-t q regularized estimator defined by ( 12. lj) with 



K = A J 1 ^ (4.13) 
V n 

for some A > 2y/2. then, for all n > 1 with probability at least 1 — m n 1 ~ j42//8 we have 

-WY-Xnl < 9A2 f sM °Z m " (4.14) 
n k n 



k z V n 

Remark 4.4. From this theorem, we obtain ^-consistency and the corresponding rate of 
convergence. Due to the fact that ||7||^ < 11711^ for all 1 < q < 00, we obtain £ q consistency 

also if s n d n \l — — - — > 0. If we want to the rate of convergence for ^-consistency, a direct 

V n 
result will be 



2 < 144Ay 4 sXlogm n (416) 



which is suboptimal. Recall that \\j3 - (3*\[ 2 tl < p n d n \\P - P*\\j 2 , if \S{0)\ is 0(s n ) and the 
elements in (3j — (3* are balanced for j e S, then we can also achieve the optimal rate 
of convergence for £ 2 -norm consistency. How to obtain optimal rate of convergence for 
^-consistency for general q would be an interesting future work. 
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Proof: From equation 14. 2\ Using f3 = (3*, we have that on the event A, 



" jeS(/3*) y ieSC/J*) 

£ (^ W ||^-/3*lk<3 £ K-) W ll^-/5;iK4.18) 

jeS(/3*) c ieS(/?*) 

By the last equation, we have that assumption [1] hold on event .4, by this assumption, we 
have that 



l\\Y-xp*wi>K 2 j2 &) 2/ *~iS-#iiv ( 4 - 19 ) 

By combining the above inequalities, we get 



1 - QA 2 <? r/ 

-||y-X/?*||| 9 < JA < nd " (4.20) 



and 



^ - < OA - v ; ( 4 - 21 ) 



Thus, we have 



ieS(/3*) ^ 



Pn Pn 



= El^-^I^^E^^'HS-^lk ( 4 - 22 ) 
= E to) w ll?i-#lk+ E &) w ll&-#lk ( 4 -23) 

j65(/3*) ieS(/3*) c 



< 4 £ (d J -) w ||S-/?;ik<4v^ / £ ( d i) W_1 llS-^1l? 4 - 24 ) 



12A n G? n s n 12A 2 a 2 s n d n /logm n . . 

< j = 2 V • ( 4 - 25 ) 

k k z V n 

Note, equation I4.2UI is exactly equation 14.171 □ 

B. Oracle Inequalities for Prediction Error Under Misspecified Models 

Assuming the true regression function f*(X) is not linear, i.e. the model is misspecified. We 
can no longer obtain the optimal rate of convergence directly. But we can still obtain a spar- 
sity oracle inequality, which can bound the prediction error in terms of nonzero components 
of the prediction oracle. 
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Assumption 2 Assume s' is an integer such that 1 < s' < p n , and 5 is some positive 
number, then, for any 7 ^ 

k(s', 6) = min min W X ^h 2 > Q 

s c{i,..., P y-\so\<s> i: jeS g(d J )i/ 9 'ii^iu, 3 <(2+|)i:, eSo ( C f J )i/^!i^iu (? ^JY J]eSo (d ] y/<i'- 1 \\ lj \\\ 

Theorem 4.5. Under assumption (0), let 6\, . . . , e n be independent W(0, a 2 ) random vari- 
ables with a 2 > 0. Consider the l\-l q regularized estimator defined by ( 12. lj) with 



A„ = Aaypp- (4.26) 
for some A > 2y / 2- then, for all n > 1 with probability at least 1 — m n 1_j42//8 we have 

-wr-xMl 

n 2 

<(! + *) inf (l| r .^ + ^( ^l^ \| (4 . 27) 

where C{5) > is a constant depending only on 5. While \S(/3)\ represents the number of 
nonzero elements in the set S(j3). 

Remark 4.6. From this sparsity oracle inequality, if we add some assumptions, such as 

there exists some /3', such that— 1|/* — X/3'llf — > 0, then we can still obtain prediction error 

n 

consistenoy if » _ o. [f we al80 want to obtain a convergence rate 8imilar t0 
that as in theorem 14.31 more conditions will be needed, as is shown in corollary 14.81 



Proof: Fix an arbitrary j3 G M m " with l^/?)) < s'. On the event A, we get from lemma 14.11 
that 



1 Pn 

1 

.7=1 

<h\Xf3-r\\l+A An(^) W HA'-/3ilk ( 4 -28) 



n 



Further from above, we can get that 

hy-ni < hxp-p\\i+?>\ n £ (^) w iiA--^-ik 



n n 



< 

n 



h\Xp-r\\i+3*nJdn\S(0)\J J2 i^'- 1 ^ - (4-30) 



j'6S(/3) 
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Consider separately the cases where 



3 E A n (^) W lfe-/3ilk<^IW-/ 



*i|2 



(4.31) 



and 



3 E Wj) w \\Pi-PAk>^\\xP-f 

jeS(/3) " 



*l|2 



(4.32) 



In case (14.311) . the result of the theorem trivially follows from equation (14.281) . So ,we will 
only consider the case (14. 32ft . All the subsequent inequalities are valid on the event An A\ 
where Ai is defined by (14.321) . On this event, we get from (14.281) that 

J^WWft - < 3 (i + 1) E ( d i) 1/q 'W% - PjK ( 4 - 33 ) 



which further implies that 



i€5(/3) 

By assumption [2], we have 



E K-) w llS-^-lk< ( 2 + f) E te) w ii£-&ik 



(4.34) 



ie5(/3) 



Combining this with (14.301) . we get 



n 



|F-X/3||, 2 (4.35) 



-r-rill < -\\X(3- r\\i + 3X n K~ 1 (s\5)\ d n \S((3)\[^\\Y -X(3\\ £2 \ (4.36) 



n 



n 



< -\\X(3~r\\i 2 +AX n K~ 1 (s' } S)Jd n \S((3)\[ -=\\Y-f* 



n 



n 



\l2 



+—\\xp-r 

Jn 



(4.37) 



This inequality is of the same form as (A. 4) in (IBunea et all l2007al ). A standard decoupling 
argument as in (IBunea et all l2007al ) using inequality 2xy < — + by 2 with b > 1, x — 

A^/sT 1 ^' , S)\l d n \S(j3)\, and y being either — i=\Y — /*|k or —=\\Xj3 — /*|k yields that 
v - y/n 



Taking 6 = 1 + 2/5 in the last display finishes the proof of the theorem. □ 
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From the above sparse oracle inequalities, we can show that the l\-l q regression estimator can 



achiev e the optimal rate of convergence if some "weak sparsity" condition holds (IBunea et al. 



2007bj). The main intuition is, even if the true function /* can not be represented exactly 
by a linear model X/3, but for some (3 G M. m ™ the squared distance from /* to X(3 can be 
controlled, up to logarithmic factors, by \S(0)\/n. Then, the optimal rate of convergence 
can still be achieved. More formally, we define an oracle set as 

Definition 4.7. Let B be a constant depending only for f* and define an oracle set as 

B=U: s.t. ±\\r-X(3\\l<B\ 2 n \S((3)\\ (4.39) 



Corollary 4.8. Under the same condition as in theorem \4.5l if the oracle set B is nonempty 
and there is at least one element [3 such that \S({3)\ < s', we have 

i ?? A4£^\ 

n \ n J 

Therefore, when s' < s n , the t\-l q regression estimator achieves the optimal rate of conver- 
gence. 



Remark 4.9. Generally, the conditions for estimation consistency is weaker than those 
for variable selection consistency. For q = 1, why assumption [2] and [J a re we aker than the 



assum ptions in theorem 13.11 can be found in (IMeinshausen and Yul . 120061 ) and (IBickel et al 



20071 ). The cases for q > 1 and the group cases should follow in a similar way. 



V. Risk Consistency 

In this section, we study the risk consistency (or persistency) property with random design, 
which holds under a much weaker condition than variable selection consistency and does 
not need the true model to be linear. Instead of directly to show the persistency result 
for the estimator defined in equation I2.1[ we show the persistency result for a constrained 
form estimator, which is equivalent to the estimator in 12.11 in the sense of primal and dual 
problems. 

Due to the fact of rando m design and increasing dime nsions, the same triangular array 



statistical paradigm as in (jGreenshtein and Ritovl . 120041 ) is adopted. In the following, we 



use calligraphic letter, such as Z to represent random variables, while Z to represent its 
realization. Consider the triangular array z[ n \ . . . , Z^f 1 (which is simplified as Z\, . . . Z n ), 
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our study mainly focus on the case where Z\, . . . , Z n ~ F n G T n , where T n is a collection 
of distributions of m n + 1 dimensional i.i.d. random vectors 

Zi = (y>i, Xi,i, . . . , X itI nJ i = l,...,n (5.1) 

with the corresponding realizations 

Zi = (Yi,X i:l , . . . ,X i>mn ) i = l,...,n. (5.2) 



Denote 



and define 



7 = (-1, Pi,..., Pm n ) = (Po, P h ...,P m J, (5.3) 



R F M=^\y-J2 X AJ =7 T Sf„7 (5.4) 
where Z = (y,X h ..., X rj J ~ F n G T n and (E F J = EZ T Z. 

Given n observations Zi, . . . , Z n , denote their empirical distribution by F n and define the 
empirical risk as 



R Pn {P)= 1 T Hpj (5.5) 



where £ 



n 



i=l 



Given a sequence of sets of predictors B n = {Y^Ziidj) 1 ^' \\Pj\\e q < L n }, the sequence of 
estimators (3 Fn is called persistent if for every sequence F n G T n ', 

R Fn 0^) - R Fn ((3^) ^ 0, (5.6) 

where 

P pn = argmini? p (/?) = argmin ||Y - (5.7) 

P?» = argmini? Fn (/3). (5.8) 

/3€B„ 



To show the persistency result, a moment condition as in (IZhou et all 120071 ) is needed 



Assumption 3 For each j,k G {1, . . . ,m n + 1}, denote E = (ZZ T — K(ZZ T ))j t k, where 
Z = (y, X\, . . . , X mn ), suppose that there exists some constants M and s. 

E(\E\ q ) < q\M q ~ 2 s/2 (5.9) 

for every q > 2 and every F n G T n . 



21 



Theorem 5.1. Suppose that m n < e n for some £ < 1. If L n = o ((n/logn) 1/4 ), then t x 4 q 
regularized regression is persistent. That is, for every sequence F n e T n : 

R Fn p -)-R F M n )=o P (l). (5.10) 

Proof: For any j,k& {1, . . . , m n + 1} and any 5 > 0, from assumption [3] we can apply the 
Bernstein's inequality and obtain 

v{\£ F J lA -p F JJ>6)<e-™ s2 (5.11) 

for some c > 0. Therefore, by Bonferoni bound we have 

p(max| (E p J lk - (E^y > s) < m^e"^ 2 < e 2 ^ 2 < e"^/ 2 (5.12) 



for large enough n. For a sequence 5 n = \l — °& n we 

cn 



p bi x i (%)^ - ( E ^y > *») < ^ - ° (5 - 13) 



which implies that 



max 



i±i(EA ) it -( Ep „y = 0p ^!2inj. ,5,4) 

Therefore, 

sn P \R F M-Rp n (P)\ = sup| 7 T (S K -E^) 7 | (5.15) 

< maxKS^J.^-^J^IhlH (5.16) 



Pn 



< max| (EpJ . A - (Z Fn )J ( 1 + £ Wi\W ) (5-17) 



Pn 



< max|(%J. ± -(S Fn y l + ^) W ll^lk ( 5 - 18 ) 



< max|(E^) - (E F J iifc |(l + L n f = o P {\) 

3 ill — '— — 



for L n = o ( (n/ log n) 1,/4 ) . 



Further, by definition, we have -Rp (/3 F ™) < R Fn (j3^ n ), combining with the following meqUal- 



i'n 

ities 



R Fn {P Fn ) - R Fn {P Fn ) < wp\RfM ~ RfSP)\ ( 5 - 19 ) 

R F M n ) - RfM") < sup\R Fn ((3)-R Fn W)\- (5-20) 

PeB n 
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This implies that 



R Fn (P Fn ) ~ RfM u ) < 2 sup R Fn ((3) - R p (p)\ = o P (l), (5.21) 



which completes the proof. □ 



VI. Discussions 



The results presented here show that many good properties from ^-regularization (Lasso) 
naturally carry on to the l\-t q cases (1 < q < oo), even if the number of variables within 
each group also increase with the sample size n. Using fixed design, we get both variable 
selection and estimation consistency under different conditions. Using random design, we 
get persistency under a much weaker condition. Our results provide a unified treatment for 
both the iCAP estimator (q = oo) and the group Lasso estimator (q = 2). 



Our results can also proy ide theoretical analysis to the simultaneous Lasso estimator (ITurlach et al. 



20051 ; iTropp et al.l . 120061 ) for joint sparsity. Which can find a good approximation of several 
response variables at once using different linear combinations of the high dimensional co- 
variates. At the same time, it tries to balance the error in approximation against the total 
number of covariates that participate. Assuming that we have altogether d n response, the 
2-th signal is represented as e W 1 , and the design matrix is X = (Xi, . . . , X Pn ) e R nxpn . 
Denote the model as 



y« =X^ + e ii \ z = l,...,4 



(6.1) 



The simultaneous Lasso estimator can be formulated as 

dn 



Pn 



= argmin — V \\Y^ - Xj3^% + A„ V max l^l, (6.2) 



This problem can be formulated as a standard i\-i q regularized regression estimator with 
q = oo. For this, define 



Y 




\ 




fx 






( Pil) \ 


e R nin x = l Sn a 

J 


5 X = 


V 


x ) 


and (3 = 





(6.3) 



where ® denotes the Kronecker product. Therefore, the simultaneous Lasso estimator can 
be rewritten as 



, . . . , = arg min — 



fiW ,...,fi( d n) 



2n 



Y — X(3 



Pn 



+ \' n y^(d n ) max_ |/3 
le{i,..,d„} - 



(6.4) 
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where = \ n /d n . This is just an £i-£oo regularized regression estimator with block design. 
Therefore, all results in this paper can be applied to analyze such type estimators. 
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